Red Wine Quality by Caio Lacerda

This dataset contains Red Wine quality measurements. The inputs include objective tests (e.g. PH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent).

Univariate Plots Section

Checking the structure and variables of the dataset.

## [1] 1599   13
## 'data.frame':    1599 obs. of  13 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  $ quality.factor      : Ord.factor w/ 6 levels "3"<"4"<"5"<"6"<..: 3 3 3 4 3 3 3 5 5 3 ...
##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality      quality.factor
##  Min.   :3.000   3: 10         
##  1st Qu.:5.000   4: 53         
##  Median :6.000   5:681         
##  Mean   :5.636   6:638         
##  3rd Qu.:6.000   7:199         
##  Max.   :8.000   8: 18

Transformed the quality variable in a ordered factor.

In this section we can see the distribution of the variables in the dataset.

We can see above that the majority of data has ratings between 5/6. With a few rating 3 or 8.

Here in alcohol the distribution lies between 8 to 15%.

For sugar we removed some high outliers (we could check this in the summary with a max value of 15) to verify the structure of the distribution.

For citric acid we can see the majority of the values between 0 and .8

For pH between 2.5 and 4.

Univariate Analysis

The dataset has 1599 observations of 12 variables.

The features that comes to mind first, which we imagine that influences the quality ratings are alcohol, residual.sugar/chlorides, perhaps citric.acid, this one is described as adding “‘freshness’ and flavor to wines”

Perhaps all the other features have underlying significance, however, as they have no described effect in taste, they initially are not being analysed.

Residual sugar has very high outliers. The bulk of data in have quality values of 5 or 6. The citric acid is somewhat evenly distributed between 0 and .5. Converted the quality variable into an ordered factor, since they have fixed classifications.

Bivariate Plots Section

Variables x Quality plots

Here I decided to plot all the variables that were investigated in the first analysis, which I believe are more likely to be correlated with the quality rating.

Between the four plots, we could verify that only alcohol has a visible change in distribution and mean in the various quality ratings.

GGPairs for all variables

In this section I decided to plot all the variables together and their correlations to verify if some other feature has some kind of relevance or relationship between them.

With this I could see relationships between pH and the acidity describing variables, which makes sense. However I decided to verify if there was any oddity between the highest of them and pH.

Plots pH/Acidity features

## [1] 0.6717034
## [1] -0.6829782
## [1] -0.5419041

In these we could check that they are reasonably distributed and correlated.

Bivariate Analysis

As we could see from the previous boxplots, the only variable that has almost has .5 of correlation with the quality variable is alcohol, with .476, the second closest correlation is a negative one, volatile.acidity, with -.391.

As for other variables we can see negative correlations with slight linear relationship between pH and fixed.acidity and citric.acid, which makes sense, given that pH is an acidity measure, the higher the lower the pH the higher the acidity, and the quantity of citric.acid varies with the fixed acidity. These correlations boast the highest values (.671 and -.682), as we could see.

Multivariate Plots Section

Verifying the correlation between volatile acidity, in which we could verify earlier in the ggpairs that has a high correlation and could contribute to a predictive model, and alcohol vs quality. Bar a few outliers, we can see that there is a tendency. Curious though, because this feature measure the “vinegar taste” of the wine and its said that the more the wine has this the more unpleasant it tastes.

I also decided to plot the sulphates feature, which has, considering the other features, a good correlation to quality. This is an inorganic compound said not to influence in the taste, but we can verify a little correlation in the plot.

We can verify as well the relation between the three variables we checked before, pH and acidity.

With all this said, we can try to build a model with the best variables available.

## 
## Calls:
## m1: lm(formula = quality ~ alcohol + volatile.acidity, data = rw)
## m2: lm(formula = quality ~ alcohol + volatile.acidity + sulphates, 
##     data = rw)
## m3: lm(formula = quality ~ alcohol + volatile.acidity + sulphates + 
##     citric.acid, data = rw)
## 
## ==============================================================
##                          m1            m2            m3       
## --------------------------------------------------------------
##   (Intercept)           3.095***      2.611***      2.646***  
##                        (0.184)       (0.196)       (0.201)    
##   alcohol               0.314***      0.309***      0.309***  
##                        (0.016)       (0.016)       (0.016)    
##   volatile.acidity     -1.384***     -1.221***     -1.265***  
##                        (0.095)       (0.097)       (0.113)    
##   sulphates                           0.679***      0.696***  
##                                      (0.101)       (0.103)    
##   citric.acid                                      -0.079     
##                                                    (0.104)    
## --------------------------------------------------------------
##   R-squared             0.317         0.336         0.336     
##   adj. R-squared        0.316         0.335         0.334     
##   sigma                 0.668         0.659         0.659     
##   F                   370.379       268.912       201.777     
##   p                     0.000         0.000         0.000     
##   Log-likelihood    -1621.814     -1599.384     -1599.093     
##   Deviance            711.796       692.105       691.852     
##   AIC                3251.628      3208.768      3210.186     
##   BIC                3273.136      3235.654      3242.448     
##   N                  1599          1599          1599         
## ==============================================================

We can see that the model does not perform very well as the features are not too strongly descriptive of the quality ratings.

Multivariate Analysis

We could see the relationship between the pH and acidity variables clearly in the plot above. We could also verify the volatile acidity/sulphates and alcohol to verify the correlation checked before against the quality ratings. I tried to build a linear model based on some of the variables analysed. The model is not able to explain all the variance in the data, with a r-squared of just .33, which makes it hard to predict with the variables we have.


Final Plots and Summary

Plot One

Description One

This first plot, helped me to verify the tendency of alcohol being correlated to the quality score. As we can see the median and mean with a clear upwards distribution tendency as the quality increases. So this is a good feature to be considered for a model, in comparison to all the others.

Plot Two

Description Two

This is one of the plots in which I could verify one of the strongest relationships in the dataset, the more citric acid the lower the pH, which makes sense with the components of the wine.

Plot Three

Description Three

In this one I could check the best 3 features description of the wine quality, these three have the best correlation in the dataset so it makes sense to see them all together. It’s curious though, that high volatile acidity makes the wine better, because its the process of the wine “turning into vinegar”.


Reflection

It’s my debut exploring data on my own so it’s been a nice first challenge exploring this dataset. The structure and values of the dataset are quite simple so exploring them individually had not been that difficult. A thing that was challenging for me was to verify that the variables were not too clearly descriptive of the target, correlations were low for the most of them, so the analysis had not been as straightforward as I thought it would be. I’ve grown accustomed to the R language, it’s a very simple language, to work with, and my knowledge of it being just what I had in the course, it was very comfortable to use it during this analysis. For a future analysis it would be nice to figure out if a combination of the components of acidity could be more descriptive of the quality, and if the quality rating could be broken down into more descriptive features of taste, so we could verify what characteristics, according to the rater, made the quality rating, and verify more clearly if it makes sense with the components data.